Identifying optimal incomplete phylogenetic data sets from sequence databases.
نویسندگان
چکیده
We introduce a new method for identifying optimal incomplete data sets from large sequence databases based on the graph theoretic concept of alpha-quasi-bicliques. The quasi-biclique method searches large sequence databases to identify useful phylogenetic data sets with a specified amount of missing data while maintaining the necessary amount of overlap among genes and taxa. The utility of the quasi-biclique method is demonstrated on large simulated sequence databases and on a data set of green plant sequences from GenBank. The quasi-biclique method greatly increases the taxon and gene sampling in the data sets while adding only a limited amount of missing data. Furthermore, under the conditions of the simulation, data sets with a limited amount of missing data often produce topologies nearly as accurate as those built from complete data sets. The quasi-biclique method will be an effective tool for exploiting sequence databases for phylogenetic information and also may help identify critical sequences needed to build large phylogenetic data sets.
منابع مشابه
Terraces in phylogenetic tree space.
A key step in assembling the tree of life is the construction of species-rich phylogenies from multilocus--but often incomplete--sequence data sets. We describe previously unknown structure in the landscape of solutions to the tree reconstruction problem, comprising sometimes vast "terraces" of trees with identical quality, arranged on islands of phylogenetically similar trees. Phylogenetic amb...
متن کاملTreeGeneBrowser: phylogenetic data mining of gene sequences from public databases
MOTIVATION Sequence databases represent an enormous resource of phylogenetic information, but there is a lack of tools for accessing that information in order to assess the amount of evolutionary information in these databases that may be suitable for phylogenetic reconstruction and for identifying areas of the taxonomy that are under-represented for specific gene sequences. RESULTS We have d...
متن کاملOrtholog-Finder: A Tool for Constructing an Ortholog Data Set
Orthologs are widely used for phylogenetic analysis of species; however, identifying genuine orthologs among distantly related species is challenging, because genes obtained through horizontal gene transfer (HGT) and out-paralogs derived from gene duplication before speciation are often present among the predicted orthologs. We developed a program, "Ortholog-Finder," to obtain ortholog data set...
متن کاملObtaining maximal concatenated phylogenetic data sets from large sequence databases.
To improve the accuracy of tree reconstruction, phylogeneticists are extracting increasingly large multigene data sets from sequence databases. Determining whether a database contains at least k genes sampled from at least m species is an NP-complete problem. However, the skewed distribution of sequences in these databases permits all such data sets to be obtained in reasonable computing times ...
متن کاملRefining phylogenetic hypotheses using chloroplast genomics and incomplete data sets in Lasthenia (Madieae, Asteraceae)
Walker, Joseph Frederic. M.S., Purdue University. August 2014. Refining Phylogenetic Hypotheses Using Chloroplast Genomics and Incomplete Data Sets in Lasthenia (Madieae, Asteraceae). Major Professor: Nancy C. Emery. The genus Lasthenia (Madieae, Asteraceae), consists of predominantly annual plant species that are largely endemic to the California Floristic Province of western North America and...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Molecular phylogenetics and evolution
دوره 35 3 شماره
صفحات -
تاریخ انتشار 2005